I <3 Kernels
Published:
In this post I’m going to go through the kernel trick and how it helps or enables various tools in statistics and machine learning including support vector machines, gaussian processes, kernel regression and kernel PCA. This is going to be a bit of a long one, I’ll probably split it up later but for now … sorry?
Table of Contents
- Linear Methods
- Kernel Trick
- Awesome Kernel-Based Methods
- Gallery of Kernels
- Gaussian Processes
- Conclusions / Pros and Cons of Kernel Methods
General Resources
- Cambridge series in statistical and probabilistic mathematics: Asymptotic statistics series number 3
- Sorry that this one isn’t open access, but I was lucky enough to have access through my institution and it’s great. If you have institutional access or can afford it I’d highly recommend it (as of 2025)
Intro
The goal of this blog post is to make you the reader aware or appreciate more kernel-based methods. For that I’m going to structure the post as 1. Some linear-ish methods that show promise for being even better in non-linear contexts but seem computationally expensive, 2. How the kernel trick allows us to get around this computational bottlenecks, and finally 3. The final form of the kernel-based methods and those that simply don’t work without it (GPs).
I’m not going to presume knowledge of these methods beforehand as much as possible, but I am going to do a bit of a whirlwind tour, so if any of them seem interesting to you and you don’t think the level of detail I provide is good enough, I’ve tried to include some independent resources for each that maybe will provide another perspective or more detail for every sub-section (the ‘Resources’ sections).
And as should be stated in all of my posts (but to be clear isn’t), I use notation through which I understand everything or simply want consistent notation throughout a given post so will likely differ from standard notation for a given topic(s). If you think I should make a given idea or object different notation either to make it clearer or because it’s simply incorrect please email me at lc[LastNamelowerCase]@[googleAddress].com or [FirstName].[Lastname]@[my institution].edu
Linear Methods
Fisher Linear Discriminatory Analysis (KDA I)
Resources
- StatQuest: Linear Discriminant Analysis (LDA) clearly explained.
- Linear discriminant analysis - Wikipedia
- Basics of Quadratic Discriminant Analysis (QDA) - Kaggle
- FISHER’S DISCRIMINANT ANALYSIS - Sanjoy Das
- The Use Of Multiple Measurements In Taxonomic Problems - Fisher
The Gist
Fisher Linear Discriminatory Analysis or simply FLDA1 is a supervised method (meaning we know the class labels) for creating a linear projection (single value in 1D, line in 2D, plane in 3D) that separates two or more classes of objects. Here we will focus on the separation of just two classes.
The key idea behind FLDA is that you wish to construct some linear combination of the input variables to demarcate the two classes that you project the objects onto. You do this by 1. maximising the distance or variance between the two groups on the projected space and 2. minimising the variance of each group in this space. Below are some examples of this in action before we get into it to emphasise that both conditions must be satisfied to get good discrimination.

In a very non-statiscian way I’m just going to throw the formula here and leave the derivation to another day (as I will do for quite a few things in this post).
\[\begin{align} Z = \frac{\sigma^2_{\textrm{inbetween}}}{\sigma^2_{\textrm{within}}} = \frac{(\vec{w}\cdot(\vec{\mu}_1 - \vec{\mu}_0))^2}{\vec{w}^T\left(\Sigma_0 + \Sigma_1\right)\vec{w}} \end{align}\]The \(\vec{w}\) is a vector in the direction of line (or linear operator on variables for generality) that is used in the above as a projection operator, to note the statistical measures on the line. There is any analytical solution to this where,
\[\begin{align} \vec{w} \propto (\Sigma_0 + \Sigma_1)^{-1}(\vec{\mu}_1 - \vec{\mu}_0). \end{align}\]Where the solution is proportional too as increases or decreases in the magnitude of the direction vector still return the same line object. For the sake of a cool gif and for later on when we generalise this method, let’s look at how it looks when you try to optimise for \(\vec{w}\) and compare it to the exact solution above.
Let’s compare the optimisation result to my “guesses” above.






Support Models (SVM I)
Resources
Ridge Regression (KRR I)
Resources
Principle Component Analysis (Kernel PCA I)
Resources
The Kernel Trick
Resources
Awesome Kernel-Based Methods
Kernel Discriminatory Analysis (KDA II)
Resources
Support Vector Machines (SVM II)
Resources
Kernel Ridge Regression (KRR II)
Resources
Kernel Principle Component Analysis (Kernel PCA II)
Resources
Gallery of Kernels
Resources
Gaussian Processes
Resources
- A Practical Guide to Gaussian Processes
- Interactive Gaussian Process Visualization
- Gaussian process regression demo
- Gaussian Processes for Machine Learning
Conclusions / Pros and Cons of Kernel Methods
It annoys me to no end that Fisher Linear Discriminatory Analysis and Linear Discriminatory Analysis are commonly used interchangeably. Strictly “Linear Discriminatory Analysis” assumes homoscedacity (same covariances) between the two groups and that they follow normal distributions. It is for this reason that I gave up on finding a probabilistic derivation of FLDA, and I ain’t spending the time deriving it myself. Kernel Discriminatory Analysis as far as I can see is based on Fisher LDA, hence I focus on that. ↩